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Summary: We consider Bayesian analysis of a class of multiple changepoint models. 
While there are a variety of efficient ways to analyse these models if the parameters associ- 
ated with each segment are independent, there are few general approaches for models where 
the parameters are dependent. Under the assumption that the dependence is Markov, we 
\q \ propose an efficient online algorithm for sampling from an approximation to the posterior 
distribution of the number and position of the changepoints. In a simulation study, we show 
that the approximation introduced is negligible. We illustrate the power of our approach 
through fitting piecewise polynomial models to data, under a model which allows for either 
continuity or discontinuity of the underlying curve at each changepoint. This method is 
competitive with, or out-performs, other methods for inferring curves from noisy data; and 
uniquely it allows for inference of the locations of discontinuities in the underlying curve. 
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Changepoint models are commonly used for time-series, to allow for abrupt changes in the 
underlying model or structure for the data. Some example applications areas include ge- 
netics (Liu and Lawrence, 1999; McVean et ai, 2004), environmental time-series (Dobigeon 
and Toumeret, 2007; Seidou and Ouarda, 2007), and signal processing (Punskaya et ai, 
^ j 2002), amongst many others. 



We consider Bayesian inference for changepoint models. Existing methods for such in- 
ference are either based on MCMC approaches (e.g. Stephens, 1994; Chib, 1996, 1998; 
Lavielle and Lebarbier, 2001), or methods for direct simulation from the posterior (see e.g. 
Yao, 1984; Barry and Hartigan, 1992; Liu and Lawrence, 1999; Fearnhead, 2008). The 
methods for direct simulation have the advantage over MCMC of producing iid draws from 
the posterior, and they can also be implemented efficiently so that their computational 
cost is linear in the number of observations (Fearnhead and Liu, 2007). However they are 
limited in terms of the class of models that can be considered. If we call the period of time 
between two successive changepoints a segment, then direct simulation methods require 
the parameters associated with each segment to be independent of each other, and that 
the marginal likelihood for the data within each segment can be calculated analytically (or 
numerically, see Fearnhead, 2006). 
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One implementation of the direct simulation methods is based on solving filtering recursions 
(Fearnhead and Liu, 2007). We process the observations one at a time, and when processing 
the observation at a time t say, we calculate the posterior distribution of the time of the 
most recent changepoint prior to t. Here, we extend this direct simulation methods to 
models where there is dependence across segments. We assume that the dependence is 
Markov, so that parameters in the current segment depend only on the parameters in 
the previous segment. The assumption of dependence across segments greatly increases 
the complexity of calculating the posterior distribution, and to develop a computationally 
efficient algorithm we introduce a simple approximation. At time t we approximate the 
distribution of the parameters associated with a new segment, conditional on a changepoint 
at t. While this conditional distribution is a mixture distribution, with the number of terms 
in the mixture increasing exponentially with t, we approximate the mixture by a single 
distribution. This approximation leads to an efficient algorithm, but one that produces iid 
samples from an approximation to the posterior distribution of interest. 

We demonstrate our new method on the problem of fitting piece-wise polynomial models. 
Here dependence across segments arises due to assumptions of continuity of the underlying 
curve. Existing methods for this problem include the MCMC methods of Denison et al. 
(1998) and DiMatteo et al. (2001), who also sample from an approximation to the posterior 
of interest, from approximating the marginal likelihood associated with each segment. Our 
model extends existing models that are considered, by allowing for the possibility of either 
continuity or discontinuity of the curve at each changepoint. Our approach also allows for 
online analysis of time-series. 

The outline of the paper is as follows. Firstly we introduce the class of changepoint models 
we consider. Then in Section 3 we develop out algorithm for Bayesian inference for these 
models. Section 4 then analyses the resulting algorithm for the specific application of fitting 
piece-wise polynomial models. We first show that the approximation introduces negligible 
error when analysing simulated data from the true model. We also compare the resulting 
method with both wavelet-based methods and the MCMC method of Denison et al. (1998), 
and look at the power of the method for detecting discontinuities in the underlying signal. 
Section 5 applies our method to analysing well-log data. Here the focus of inference is 
in detecting changepoints where the underlying signal is discontinuous. Finally the paper 
ends with a discussion. 



2 Changepoint model 

We consider the following hierarchical model for observations yi :n = (y±, . . . ,y n )- Firstly 
we introduce a model for the number, /, and position, < T\ < ■ ■ ■ < 77 < n, of the 
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changepoints. This is based on a distribution for the distance between two successive 
changepoints 



p{r k - r fc _i 



d) = g(d), 



(1) 



for some discrete distribution g(-) on the positive integers. We define To = and r^ +1 = n, 
and we let G(s) = Yld=i 9(d) be the corresponding cumulative distribution function. We 
assume independence of the distance between different pairs of successive changepoints, so 
that the joint probability of I specific changepoints is 



The changepoints split the data into / + 1 segments, with the kih segment containing 
observations y Tfe+1 . Tfc+1 , for k — 0, . . . , I. For segment k we associate a model M k and a 
vector of parameters 6 k- The model is drawn from a finite set of possible models, M. and 
we assume that there is independence of the choice of model across different segments. 
(Extension to the case where the model of a segment depends on the model of the previous 
segment is possible, see Fearnhead and Vasileiou, 2009). 

For k > 1 we allow the distribution of 6k to depend on the position of segment k — 1, T k -\ 
and Tfc, and its parameter 6k-\- Thus we have that the conditional probability of the model 
and parameters for the segment can be factorised as 



For the first segment we assume a prior for 6 . Note that this framework includes mod- 
els where there are common parameters across segments. In this case some components 
of 6k are equal to the equivalent components of 6k-i and the conditional probability 
Pm(6k\6k-iT~k,Tk-i) in only non-zero for parameter combinations that obey this constraint. 

Given a segment defined by changepoints at positions s and t, and with model m and 
parameter 6 we have a likelihood model 



We assume that conditional on the changepoints, segment models and parameters, the 
observations within each segment are independent of each other. 

Finally we assume that there exists a family of conjugate priors for 6, p m {6\()- Thus for 
all m, ( and y t and s, t, we can calculate 




k=i 



Pr(M k = m)p m (6k\6k-ir k ,Tk-i). 



p m (y s +i:t\6). 



(2) 




(3) 
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where 

/ \n \ Pm(y S +l:t\0) 

p m {yt\0,s) = 



Pm(y S +l:t-l\0) 

is the probability density of y t given a segment that started with observation y s +i- Fur- 
thermore, conjugacy imples that there exists a (' such that 

Pm(e\(')<XPm(yt\e,s)p m (e\(), (4) 

where the constant of proportionality is defined so that the right-hand side integrates to 1 
(with respect to 9). We denote the value of (' defined by (4) by an update function u s : 

C = u s (t,m,()- (5) 
This update function (and hence (') will depend on the data y s +i-.t- 

We now give an example of such a changepoint model, which will be used throughout the 
paper to demonstrate and make concrete the ideas we present. 

Example: Piecewise Polynomial Regression 

We consider filtering a piecewise polynomial regression model to bi-variate data (xi,yi) for 
i = 1, . . . , n, with the data ordered so that x 1 < x 2 < ■ ■ ■ < x n . For concreteness we will 
focus on piecewise quadratic models, but the extension to polynomials of different order is 
st raight forward . 

If the observations y s +v.t are in the kth. segment, we specify the model of (2) by: 

y s+1:t = H k (3k + £fc, (6) 
where the design matrix is of form 



/ 1 \ 

1 X s+ 2 — X s+ \ (x s+ 2 — X s+ i) 2 

\1 x t -x s+1 (x t -x s+ if j 



Ek is a vector of noises that are independently drawn from a N(0, a 2 ) distribution, and 
Pk — (flk,o, Pk,i, fik,2) is a vector-valued regression parameter. 

For simplicity, we model the distance between successive changepoints as geometric with 
mean 1/p, so g(d) = p(l —p) d ~ l . For each segment except the first we allow for one of two 
models: M — 1 refers to the underlying curve being discontinuous at the changepoint that 
starts the segment, and M = 2 refers to the curve being continuous at this changepoint. 
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Our prior is that the model of each segment is equally likely to be either possibility. Note 
that if M — 2 then (3 k;0 will be determined by the length and parameters of the previous 
segment. 

We assume that a 2 is common to all the segments. However, to be consistent with the 
above framework, we introduce o\ to denote its value in the /cth segment. Thus we have 
that 6k = (o"fc,/3fc) and 6k depends on 6 k -i as a k = cr^-i, and if M — 2 through the 
dependence of j3 k fi on flk-i- 

We use the following standard conjugate priors for the variance a\ and the regression 
parameter (3 k for both M = 1,2: 

a\ ~ 10(^/2,7/2), 
(3 k \a 2 k ~ N(/x,^D), (7) 

where IG denotes the inverse Gamma distribution and N denotes the Gaussian distribution. 
With the notation above, we have ( = (u, 7, fx, D). For the first segment, for which M = 1, 
we have prior parameter £ ,i = (vo, 7o, 0, D ), with D = diag(<5o, Si, 62)- For a future 
segment k with M k = 1, we have the distribution for l3 k given by (7) with fj, = (0, 0, 0) 
and D = D . For a segment k with = 2, and the previous segment starting with 
observation x r+ i and ends with observation x s , the distribution for (3k is given by (7) with 
At = (/3fc-i,o + APk-1,1 + A 2 /3 fc _ lj2 , 0, 0), where A = (x s+ i - x r+1 ), and D = diag(0, S u S 2 ). 
This prior distribution ensures continuity of the underlying curve. 

We can calculate P s (t, m, () and u s (t, m, () (see Equations 3 and 5) using standard updates 
for dynamic linear models (West and Harrison, 1989); details are given in the Appendix. 
Given the changepoint positions and segment models, we have a linear model for our data, 
and due to the choice of priors we can simulate directly from the posterior distribution 
of the parameters. The difficulty with Bayesian inference for this model is due to the 
intractability of the posterior distribution for changepoint positions and segment models. 



3 Approximate Inference 

We now describe our method for drawing, approximately, from the posterior distribution 
of the number and position of changements, and model and parameter values for each 
segment. The approach is based on recursive filtering and smoothing algorithms, which we 
will describe in turn. Throughout our description we will introduce a (potentially artificial) 
time, with observation y t arriving at time t. For ease of presentation it will be useful to 
refer to the model and parameter values associated with the segment to which y t belongs. 
Hence, for the rest of the paper we will slightly change notation, with 6 t and M t refering 
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to the parameter and model value at this time t. That is we will subscript by time rather 
than by segment. We also introduce a new variable, C t , which will denote the position of 
the most recent changepoint prior to time t. 

3.1 Filtering Algorithm 

To simplify the following exposition we will first derive the filtering algorithm for the case 
of a geometric segment length, g(d) = p(l —p) d ~ l . Presentation of the algorithms we derive 
(Algorithms 1 and 2), include the details for a general segment length distribution. 

First note that (C t , M t ,9 t ) are a Markov process; and in particular the marginal dynamics 
for C t , M t are given by 

{1 — p if j = i and m = ml , 

pPr(M = m) i£j = t,meM, 
otherwise. 

The top probability refers to there not being a changepoint between y t and yt+i, and the 
middle probability refers to the event that there is. 

Now we wish to recursively approximate 

p(C t , M t , e t \y 1:t ) = p(C t , M t \y 1:t )p(e t \y 1:t , C t , M t ). 

The first term on the right-hand side is a discrete distribution, and we approximate p(Ct = 
s,M t = m\yi : t) ~ w[ s ' m \ Whereas for given Ct — s and M t = m we will approximate 
p(9t\yi-.t, C t , M t ) by p m {9t\(t S ' m ^), for some (t S ' m \ Our approximation is specified by the set 
of probabilities w^'^ and parameters (j: S '"^ for s = 0, — 1 and m G M. the set of 
possible models. 

We initiate our algorithm using the model prior, with w^ ,m " > = Pr(M = m) for m G M., 
and prior for the parameters Co°'"^ = Co,m- For t — 1, . . . ,n we have the following set of 
recursions. Firstly for s G {0, . . . ,t — 1}, C t +\ = s means that there is no changepoint at 
time t. Thus we have M t +i = M t and 9 t +i = 9 t , and 

p(C t+1 = s, M t +i = m, t+1 = 9\y 1:t+1 ) = 

Kp(C t = s,M t = m, 6 t = 9\y ht ) Pr(C m = s\C t = s)p m (y t+1 \9) , 

for some normalising constant K. Now we substite our approximation, p(C t = s, M t = 
m,9 t+ i\yi:t) ~ Wt S ' m ^p(9\(t S ' m ^). Integrating with respect to 9 gives Pr(C t = s,M t = 
m \yi:t+i), and thus 

<™ = Kw s r{\ - p)P s (t + l,m, (t' m) ) . 
6 



While, using the updates for the conjugate distribution for 9 we get 

p(9 t +l\yi:t+l, Ct+l = S, M t+1 = m) OC Pm{Ot+l\d S ' m) )Pm(yt+l\0t+l) = Pmifit+1 1 Ct+1* > ) » 

forCftr^^ + l^,^). 

Now consider Ct+i = i. This corresponds to a changepoint at time t, and Cj can take any 
value in {0, . . . , t — 1}. We derive an approximate recursion by considering 

p(C t+1 = t, M t+1 = m, 9 t +i\yi:t) = 
t-i 

Yl Yl P(C t = s,M t = m',9 t \y 1:t )Pr(C t+1 = t\C t = s)Pr(M = m)p(9 t+1 \9 t ,t,s), 

s=0 m'eM 

where p{9t+\ \9t, t, s) denotes the conditional distribution of 9t+i given Q t and that the 
previous segment contained observations y s +i-.t- Now, substituing our approximations to 
p(C t = s,M t = m', 9 t \y 1:t ) we have 

t-i 

p(9t + i\yi-.t,C t+1 = t,M t+1 =m)^J2Yl Pwl S ' m \U9t\(l s ' m \(9 t+ i\9 t ,t,s). (8) 

Our approach is to approximate this by p m {9 t +i\C t f' rr ^) for some suitable choice of Cl*'" 1 ''- 
Thus as 

p(C t+ i = t,M t+1 = m,9 t+1 \y ht+1 ) = Kp(C t+1 = t,M t+1 = m,9 t+1 \y ht )p m (y t+1 \9 t+1 ) 
= Kp(C t+ i = t,M t+ i = m\y 1:t )p(9 t+1 \y 1:t ,C t+1 = t,M t+1 = m)p m (y t+ i\9 t+1 ) , 

we get the approximate recursion 

wf^ = K Pr(M = m)P t {t + 1, m, ' m) ) £ £ w^p, 

s=0 m'eM 

and<gf ) =ti l (t + l,m,C e ( ^). 

Note that the only approximation in our filtering recursions is in the approximation of 
(8). There are various ways of choosing C^' m ^ for this approximation, and in practice we 
use a simple method of moments approach (see below). Note that this approximation is 
required to avoid the exponentially increasing computational cost of the exact filtering 
recursions. Similar approximations have been used in the Generalised Pseudo-Bayes al- 
gorithm (Tugnait, 1982), or the Interacting Multiple Model filter (Blom and Bar-Shalom, 
1988). 



7 



Algorithm 1 Filtering Algorithm 



Initiate Set wf' m) = Pr(M = m)P s (l, m, ^°' m) ) and C}°' m) = u 8 {l, m, C<S°' m) ) for m G M. 
Normalise weights, wf' m ^ and let t — 1. 

While t < n (i) For s — 0, . . . , t — 1 and m e M, set 

( s m ) 1 - G(t + 1 - S) ( s m ) / A (s,m)\ 

= ! _ g ( t _ s ) w * P * + X > m ' & J > 

and C t +r ) =«,(* + 1, m, C t (s ' m) - 

(ii) For m G .M, calculate Ct*'"^ to produce the approximation to (8). 

(iii) For m G M, set 



(t,m) 



= Pr(M = m)P t (t + 1, m, ,m) ) £ E ^ ( 



5 =0 m'£M 

andCSr^^ + l^,^)- 
(iv) Normalise weights, 



G(t + 1 - s) - G(t - s 
l-G(t- s) 
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The full filtering algorithm, allowing for a general distribution of segment lengths and prior 
distribution for models is described in Algorithm 1. 

Example Revisited 

We now give details of step (ii) of the algorithm for the piecewise polynomial regression 
model. Remember ( t = (v t , 7 t , fi t , D t ). For a new segment with M t+ \ = 1 we have m = 
and D t = D . We choose v t and j t to match moments of the predictive distribution of 

-2 

a t+v 

Assume uj: s,m ^ and ^ s,m ^ are the first two components of (j: s ' m \ Then we solve 

*-l 2 (s, m >) (t,l) 
Vfrr- 2 \^ \^ ... («."»' ^ _ U t 

ek+i) - 2. w * - ■ 

s=0 m'=l It It 



and 

t-l 2 (s,m'), n , (s,m')\ (M)/o , (*>lh 

e(^) = v v 4" m ' ^ ( T ' = (2 + y ' ' 

s=0 m'=l Ut J Wt J 



for z/f and 7f'^. 

ft 2) ft 2) 

For a new segment with M t+ i = 2, we have identical calculations for v\ ' and 7 t . 
However, in this case we have /x t +i = (77, 0, 0) and D t+ i = Diag(r, 5±, 62) for some r\ and r 
to be calculated. Again we choose values based on matching moments, this time of Pt+i,o- 

Let A s = (x t +i - x s+1 ), and a s = (1, A s , A 2 S ) T , then 

t-i 2 



s=0 m'=l 

and 

t-i 2 



s=0 m'=l 



3.2 Smoothing 

Once we have calculated the filtering distributions for all t, we can simulate, backwards in 
time, the number and position of changepoints, the segment models and parameters, given 
the full data yi :n . 
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Firstly, we can simulate (C n , M n , 9 n ) from (our approximation to) p{C ni M n , 9 n \yi:n 

). These 

will give us the start of the final segment, together with its model and parameter values. 
Assume we simulate C n = t, then we will next simulate (C t , M t , 9 t ) from 

p(C t , M t , 9 t \y 1:n , C t +i = t, C t+2 :n, M t+1:n , 9 t+1 . n ). 

This will give us the start of the penultimate segment, its model and parameter values. We 
can then repeat this backwards in time until we simulate the first segment for our data. 

To perform the simulation we use the fact that 

p(C t , M t , 9 t \y hn , C t+ i :n , M t+1:n , 9 t+1:n ) = p(C t , M t , 9 t \y 1:t , C t+1 , M t+1 , 9 t+1 ), 

by the conditional independence structure of the model. Thus we have 

p(C t = s,M t = m, 9 t \y 1:n , C t+1 = t, C t+2: „, M t+1 = m', M t+2:n , 9 t+1:n ) 
= p(C t = s,M t = m, 9 t \y 1:t , C t+1 = t, M t+1 = m', 9 t+1 ) 

oc p{C t = s,M t = m,9 t \y 1:t )p(C t+1 = t,M t=1 = m',9 t+1 \C t = s, M t = m,9 t ,y 1:t ) 

oc p(C t = s,M t = m, 9 t \y 1:t ) Pr(C t+1 = t\C t = s)p{9 t+l \C t+l = t, M t+1 = rri, C t = s,M t = m, 9 t ), 

where in the final step we have used that the model of a new segment is independent of 
the model of the preceeding segment. 

To simplify notation, let T t = {yt+i-.n, Ct+i-.n, M t+ i :n , 9 t+ i :n } denote the future of the process 
after time t. Now substituting p(C t — s,M t — m, 9 t \y 1:t ) = wl s ' m ' > p(9 t \(j: s ' m ^) we get 

Pr(C t = s,M t = m\y 1:t , T t ) oc w[ s ' m) Pr(C t+1 = t\C t = s) 
x J p(9 t \d s ' m) )p(9 t+1 \C t+1 = t, M t+l = m', C t = s,M t = m, 9 t )d9 t , (9) 

and 

p(9 t \C t = s,M t = m,y 1:U F t )cxp(9 t \(i s ' in) )p(9 t+1 \C t+1 = t,M t+1 = m',C t = s, M t = m,9 t ). 

(10) 

We need to be able calculate (or approximate) the integral in (9) and simulate from (10) 
to perform the smoothing. The full smoothing algorithm is given by Algorithm 2. The 
smoothing algorithm simulates the number and position of the changepoints, and the seg- 
ment models and parameters. Often more accurate results can be obtained by throwing 
away the simulated parameter values, and re-simulating these from their conditional distri- 
bution given the changepoints and segment models (assuming this distribution is tractable). 
Such an approach is possible for our piecewise polynomial regression example, and is what 
we used in the simulation studies later. 
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Algorithm 2 Smoothing Algorithm 



Initiate 1. Simulate (C n ,M n ) from the discrete distribution that gives probability 
Wn' m ^ to the value (s,m). Assuming (C n ,M n ) = (s,m), then simulate 9 n from 

p(0n\d S ' m) ). 

2. Set t = s,m' = m and 6 = 9 n . 
While t > I. For s — 0, . . . , t — 1 and mGM calculate 



X 



y p(^|Ct (s ' m) )p(^+i = 0\C t+1 = t, M t+1 = rri, C t = s,M t = m, 9 t )d6 t 



2. Simulate (C t , M t ) from the discrete distribution that gives probability propor- 
tional to w( s ' m ) to the value (s, m). 

3. Assume (C t ,M t ) = (s, to). Simulate 9 t from the distribution proportional to 

p(O t \tf' m) )p(e t+1 = 0\C t+1 = t, M t+1 = m', C t = s,M t = to, 9 t ). 

4. Set t — s, m! — to and 6 = 9 t +i- 
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We now give details of the calculations involved in the smoothing algorithm for our example. 
Example Revisited 

For our example 9 t = (a t ,/3 t )- Consider a changepoint at t, and C t = s. Define h = 
(1, A, A 2 ) and A = (x t+1 - x s+1 ). 

Firstly consider calculating an integral of the form 

J p(e t \() P (0 t+1 \C t+1 = t, M t+1 = m', C t = s,M t = m, O t )d0 t , 

where ( = (i/, 7, //, X>), for step 1 of Algorithm 2. For m! — 1 this becomes 

IG(<7 t+ i; u/2, 7 /2)N(A+i; 0, of +1 D ), 

where lG(x; a, b) denotes the probability density function (pdf) of an inverse-gamma dis- 
tribution with parameter a and b, evaluated at x; and N(x; 77, S) denotes the pdf of a 
multivariate normal distribution with mean /i and variance E, evaluated at x. The first 
term comes from the fact that a t+ i = a t , and second due to the independence of /3 t+ i and 
P t . For m' = 2, the integral becomes 

lG(a t+1 ; u/2, 7 /2)N(/3 t+1 ; V , <7 t 2 +1 E), 

where rj = (h/i T ,0,0), and E = Diag(h T Dh, S 1: 5 2 ). Here the conditional density for fi t+1 
has changed as now P t +i,o — due to continuity. 

Now consider simulating 9 t from a density proportional to 

p(9 t \()p(9 t+1 \C t+1 = t, M t+1 = m', C t = s,M t = m, 9 t ), 

in step 3 of Algorithm 2. For m! — 1 we set a t = Ct+i, and simulate f3 t from a multivariate 
normal distribution with mean // and variance cr 2 +1 D. For m' = 2 we again set a t = Ct+i, 
but now simulate f3 t from a multivariate normal distribution with mean /i and variance 
cr 2 +1 D conditional on h.f3j = Pt+i,o- Standard results (see e.g. Rue and Held, 2005), gives 
that we simulate (3 t from a multivariate normal with mean 

|i-Dh T (hDh r )- 1 (lvi T -A +1> o), 

and variance 

a t 2 +1 (D - Dh T (hDh T )- 1 h T D) . 
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3.3 Resampling 



Simulating from the posterior distribution of the number and position of changepoints, and 
the segment models and parameters, using the filtering and smoothing algorithms has a 
complexity which is quadratic in n. This is due to the number of support points of (C t , M t ) 
increasing linearly with t. 

At the expense of further approximation, we can develop an algorithm whose total com- 
putational cost is linear in n via using particle-filter resampling algorithms (e.g. Liu et al, 
1998; Fearnhead and Clifford, 2003) to approximate the distributions of (C t , M t ) by dis- 
crete distributions with fewer support points. (The resampling procedures ensure that the 
number of support points in the resulting approximation is bounded by a constant for all t.) 
This was investigated in Fearnhead and Liu (2007), who propose two optimal resampling 
algorithms for changepoint models, and show that substantial computational savings can 
be obtained with negligible approximation error. 



4 Simulation Study 

We now evaluate out method through a simulation study using the piecewise quadratic 
model introduced within our example. We first look at the accuracy of our filtering and 
smoothing method for simulating from the posterior distribution, and then compare the 
accuracy of our method to other approaches for curve-fitting. Finally we look at the 
accuracy of our method at inferring discontinuities in the underlying curve. 

In implementing our method we used the filter and smoothing algorithms with the stratified 
rejection control resampling method of Fearnhead and Liu (2007). The threshold param- 
eter within the resampling algorithm was set to 10~ 6 (see Fearnhead and Liu, 2007, for 
details). We used the filter and smoothing algorithms to simulate the number and position 
of changepoints, the value of the observation variance and the model for each segment. 
Conditioned on these, we then simulated the (3 values associated with each segment from 
their conditional distribution. 

The filtering and smoothing algorithms were implemented within C++ and R. The compu- 
tational cost of the algorithms is roughly linear in the number of observations, and to run 
them on a data set with 4000 data points took of the order of 10 seconds on a desktop PC. 
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(a) 



(b) 




~i 1 1 1 1 n n 1 1 1 1 1 

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

Uniform Quantiles Uniform Quantiles 



Figure 1: Posterior quantile plots of (a) the underlying curve and (b) a 2 for different values 
of n: 256 (red, dashed line), 512 (green dotted line) and 1024 (blue dot-dashed line). For 
(b) we give 90% confidence intervals obtained through simulation (black dashed line). 



4.1 Accuracy of the Simulation Method 

To test the accuracy of the filtering and smoothing algorithms at drawing samples from the 
true posterior distribution, we ran a simulation study where we simulated data under the 
exact model that we used for analysis. We then calculated the posterior quantiles of the true 
value for a and the value of the underlying curve at each time point. The rationale is that if 
we could draw from the true posterior, then these posterior quantiles should be uniformly 
distributed on [0,1]. Any inaccuracies in our simulation method will be demonstrated 
through deviations of the posterior quantiles from such a uniform distribution. 

We simulated data for the piecewise-quadratic model with D = Diag(l, 10 2 , 40 2 ), p = 4/n 
and a 2 = 1. We analysed the data under the model with the same value for Do and p, 
but with an improper prior for a 2 (equivalent to v = 7 = 0). To detect any affect that 
the amount of data had on the performance of our method we simulated 100 data sets for 
each of n = 256, 512 and 1024. In each case we used equally spaced x t points in [0, 1]. 
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Plots of the posterior quantiles are shown in Figure 1. In both cases they are close to that 
expected if they were drawn from the true posterior distribution. The extra smoothness 
in the plot of posterior quantiles of the underlying curve is due to the larger number of 
quantiles obtained in this case, lOOn as we obtain one quantile for each data point. For 
the posterior quantiles of a we are able to construct confidence intervals, as the posterior 
quantiles are independent. We notice that the observed quantiles generally lie within the 
plotted 90% confidence interval. Taken together, these results suggest that negligible error 
is being introduced by the approximations in our method for simulating from the posterior 
distribution. 



4.2 Comparison for curve-fitting 

We now look at the accuracy of our piecewise quadratic regression model, together with 
the new simulation method, for curve-fitting. Firstly, in order to implement our method we 
need to choose the prior parameter values. As above we will use the default uninformative 
prior for o. We will assume no prior knowledge of D and p, and use an empirical Bayes 
approach to estimate these hyper-parameters (as suggested in Fearnhead, 2005), whereby 
we estimate their values from the data. We did a preliminary analysis of the data (using 
default choices for Do and p) , and then estimated Do and p from the posterior distribution 
of the (3s and the number of changepoints. If necessary this could be repeated, with 
simulation from the posterior given the latest estimates for D and p, and new estimates 
of D and p obtained. 

For our simulation study we chose default value of p = 1/n and D = diag(10, 100 x 
10 2 , 1000 x40 2 ). These are substantially different from the true values used in the simulation 
(see above). For simplicity we did not repeat the iterative procedure just described. The 
effect of these choices are discussed below. 

We first quantify the accuracy of our method for analysing the same simulated data sets 
that were used in Section 4.1. For a given data set let z t denote the value of the underlying 
curve at time t (so observations are y t — z t + ae t where e t is a standard normal random 
variable). Denote by z t an estimate of z t , then we estimate the accuracy of an estimate of 
the curve z\- n by the average mean square error 



For our method we use the posterior mean as our estimate of z t . We also look at the mean 
point-wise coverage of 90% credible (or confidence) intervals for z t . 




n 



t=i 
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n 



MSE 

New BAYES.THR cthresh 



Coverage 
New wave. band 



256 
512 
1024 



0.056 0.215 0.15 
0.027 0.138 0.093 
0.014 0.087 0.056 



0.87 0.79 
0.87 0.79 
0.89 0.79 



Table 1: Mean square error (MSE) and coverage of putative 90% confidence/credible in- 
tervals for our new method, and wavelet based methods. 



For comparison we estimate the underlying curve using wavelets. We implement two 
wavelet methods, that of Abramovich et al. (1998) implemented using the BAYES .THR func- 
tion in R, and one using complex wavelets (Barber and Nason, 2004) implemented using 
the cthresh function in R. We also constructed wavelet-based confidence intervals (Barber 
et al., 2002) using the wave .band function in R. (See http : //www. stats .bris . ac . uk/~wavethresh/ 
for details of these functions; we used default settings for the R functions in all cases.) 

Results for the simulated data described in Section 4.1 are given in Table 1. We notice that 
the MSE for estimates of the underlying curve is substantially smaller for our new approach 
than for either wavelet method. Of the two wavelet methods, the one using complex 
wavelets gives superior performance. The MSE of our new method halves each time n is 
doubled, whereas the MSE of the wavelet methods decreases by a smaller proportion each 
time. Finally, the coverage of our 90% credible intervals are close to 90% in each case. The 
fact that the coverage of the intervals is less than their putative size is likely to be down 
to errors in estimating the hyperparameters. 

The choice of default starting values for p and D , used in the iterative empirical Bayes 
procedure, has a small effect on the results. For example for n = 256, repeating the 
analysis with default values of p = 10/n and p = 20/n increased mean square error by 
0.001 and 0.004 respectively. Increasing the default value for D , through scaling by a 
factor of 10 or 100, increased mean square error by 0.007 and 0.011 respectively. In these 
latter cases, our default value is substantially different from the truth, and we do see a 
non-negligible increase in mean square error. However we can avoid this by repeating the 
iterative procedure: for example in the last case repeating the procedure just 3 times leads 
to the same mean square error as reported in Table 1. 

The advantage of our method over a wavelet approach for these data is not suprising as the 
data was simulated under the model assumed by our method. To test robustness of this 
method to data being simulated from an alternative model, we repeated our simulation 
study but with data simulated under a piecewise cubic model. For this model we set 
D = diag(l, 10 2 , 40 2 , d 2 ), and considered the effect that d had. Note that the expected 
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d 



MSE 

New BAYES.THR cthresh 



Coverage 
New wave . band 



100 
200 
400 



0.06 0.34 0.16 
0.07 0.69 0.17 
0.11 2.45 0.18 



0.86 0.80 
0.86 0.82 
0.84 0.86 



Table 2: Mean square error (MSE) and coverage of putative 90% confidence/credible in- 
tervals for our new method, and wavelet based methods. Data simulated under a piecewise 
cubic model, with d affecting the size of the cubic co-efficients. All data sets were simulated 
with n = 100. 

value of the modulus of the cubic co-efficient is rf(2/7r) 1//2 . For simplicity we fixed n = 256 
for all simulations that we carried out. 

Results are given in Table 2, again based on 100 simulated data sets for each set of pa- 
rameters. As expected, as d increases, which corresponds to an increasingly non-quadratic 
components of the underlying curve, the performance of the new method deteriorates. This 
is both in terms of the coverage properties of the credible intervals, and the mean square 
error of estimates of the underlying curve. However for all values of d we considered, the 
new method still substantially out-performs both wavelet methods in terms of estimating 
the underlying curve. 

As a final comparison, we applied our new method to various test data sets from the 
literature, and compare our method with the published results of Denison et al. (1998) 
(henceforth DMS). The test data sets used are shown in Figure 2, and consist of the 
Heavisine, Blocks, Bumps and Doppler signals of Donoho and Johnstone (1994); and the 
smooth function (a) and (b) from Denison et al. (1998) (denoted DMS A amd DMS B). 
The method of Denison et al. (1998) uses a reversible jump MCMC to fit a piecewise cubic 
function, under continuity and differentiability constraints. The MCMC algorithm samples 
from an approximation to the posterior, based on approximating the marginal likelihood 
for each segment. The MCMC procedure takes up to about an order of magnitude longer 
to analyse the data than our approach. 

We compare methods based on MSE as before. Results are given in Table 3. Our method 
does considerably better at estimating the curves which contain discontinuities, as our 
model allows for discontinuities in the underlying curve. While we do similarly or better 
on DMS A and DMS B, our method is substantially worse for the Bumps and Doppler data 
sets. This is due to errors in estimating the peaks in the Bumps data set, and the initial part 
of the curve in the Doppler data set. In both cases these are where the underlying curve 
changes most rapidly. One explanation for this is that using only quadratic polynomials, 
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Heavisine 



Bumps 





DMS (A) DMS (B) 




— 1 : 1 1 1 1 1 I 1 '~f 1 1 1 1 

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 



Figure 2: Simulated data sets used for comparison with method of Denison et al. (1998) 
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n 


a 


SNR 


DMS 


NEW 


cthresh 


Heavisine 


2048 


1.0 


7 


0.033 


0.022 


0.032 


Blocks 


2048 


1.0 


7 


0.170 


0.016 


0.116 


Bumps 


2048 


1.0 


7 


0.167 


0.318 


0.100 


Doppler 


2048 


1.0 


7 


0.135 


0.198 


0.050 


DMS A 


200 


0.4 


3 


0.010 


0.010 




DMS B 


200 


0.3 


3 


0.009 


0.006 





Table 3: MSE results for 6 test data sets (see Figure 2). For each data set we give the 
number of data points, n, the observation error, a, and the signal-to-noise ratio. MSE 
results for DMS are taken from Denison et al. (1998). 



rather than cubic, makes it harder for our model to fit these parts of the curve. 

The results in Denison et al. (1998) suggest that the DMS method is more accurate than 
using wavelets. We investigated this by calculating mean square errors for estimates ob- 
tained using the complex wavelet method implemented in cthresh. Results are given in 
Table 3 for the four data sets where the number of observations were an integer power 
of 2 (and thus it is straightforward to apply the wavelet approach). We get different re- 
sults from Denison et al. (1998), with the wavelet approach out-performing the other two 
approaches for Bumps and Doppler, and out-performing DMS for Blocks. 



4.3 Power at detecting discontinuities 

Finally we look at the power of our method for detecting discontinuities in the underlying 
curve. Note that it is only our method that can potentially distinguish between change- 
points at which the underlying curve may be either continuous or discontinuous. We focus 
on this feature of our method due to the application of the method we consider in Section 
5. 

We used as a basis the continuous curve in DMS B (see Figure 2). We then introduced 
a discontinuity into the curve. If we denote the underlying DMS B curve by f(x) for 
x G [0, 1], then we introduce a changepoint of size c at point x c to produce the curve: 

\ — $ f( x ^) ~ ca f° r x < Xc ' 
f(x;c,x c ) = ^ f ^ { 0IX > Xc) 

where a 1 is the variance of the observations. We then simulated data centered on this 
curve, and look at the posterior probability of a discontinuous changepoint at between 
[x c — 0.01, x c + 0.01]. We repeated this for different values of c, x c and sample size n. 



19 





Figure 3: The posterior probability of a changepoint within [x c — 0.01, x c + 0.01] for DMS 
A, for different changepoint positions of changepoint c and number of observations 

n. Figure (a) is x c = 0.3, (b) is x c = 0.45, (c) is x c = 0.6 and (d) is x c = 0.7. For each plot 
the lines correspond to different values of n: n = 100 (black full line); n = 200 (red dashed 
line); n = 400 (green dotted line); and n = 800 (blue dot-dashed line). 



Results are given in Figure 3. As expected the posterior probability of a changepoint 
increases with both n and c, and to a lesser extent by the position of the changepoint. The 
lowest posterior probability of a changepoint occurs when x c = 0.45, which is the point at 
which the gradient of the signal is greatest, and this makes jumps in the signal harder to 
infer. In general an average posterior probability of a changepoint of greater than 0.5 can 
occurs with c > 3 when n is 200 or more; and when c > 2 and n = 800. 



5 Well-log Data 



We now apply our method to analyse the well-log data of O Ruanaidh and Fitzgerald 
(1996). The data is shown in Figure 4, and consists of a time-series of measurements of 



20 



rock as a probe is lowered through a bore-hole in the earth's surface. We have scaled time 
so that time-series is over the interval [0, 1]. The underlying signal has a number of abrupt 
changes, due to the changes in rock strata. It is of interest to locate these abrupt changes 
in the signal. See O Ruanaidh and Fitzgerald (1996) and Fearnhead and Clifford (2003) 
for further discussion of this data set, and the practical importance of detecting changes 
in rock strata. Furthermore Fearnhead and Clifford (2003) discuss the need for online 
methods for analysing data of this type. 

Both O Ruanaidh and Fitzgerald (1996) and Fearnhead and Clifford (2003) fit a piece- 
wise constant signal to the data and assume observation error is independent over time. 
However, Fearnhead (2006) suggests that such a model is inappropriate as it ignores local 
variation within segments, and fitting such a model results in the detection of too many 
changepoints. Thus here we will consider analysing the data under our model. The idea 
is that our model is flexible to allow for variation within rock strata through changepoints 
at which the underlying signal is continuous. Changes in rock strata will correspond to 
changepoints at which the underlying signal is discontinuous. Our interest is thus in de- 
tecting the position of these discontinuous changepoints. 

As in 6 Ruanaidh and Fitzgerald (1996) we first remove outliers from the data, and then 
analyse the data in batch. We consider two analyses, one allowing for the possibility of 
changepoints at which the underlying signal is either continuous of discontinuous; and the 
other which only allows changepoints where the underlying signal is discontinuous. The 
latter mimics the models of O Ruanaidh and Fitzgerald (1996) and Fearnhead and Clifford 
(2003). We call these models, model A and model B respectively. 

Results are given in Figure 4. For each model we plot the posterior probability of a 
discontinuity of the signal in an interval [t — 0.001, t + 0.001] for different values of t. For 
simplicity we infer a discontinuity whenever this probability is greater than 0.5, and plot 
the inferred changepoints for the two models. Model B appears to overfit discontinuities 
in the data (posterior mean number of discontinuities, 30, is nearly twice that for model 
A), and using our simple procedure for highlighting changepoints, infers an extra three 
discontinuities in the data - which by eye look spurious. 



6 Discussion 

We have presented a novel and computationally efficient procedure for Bayesian inference 
for changepoint models, where there is Markov dependence in the segment parameters. The 
method is approximate, in that it is based on an approximation to the filtering distribution 
of parameters associated with a new segment. When used with the resampling idea of 
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Figure 4: (Top) Raw well-log data, with inferred discontinuities (red dashed vertical lines: 
both models; blue dot-dashed line: model B only). (Middle and Bottom) Posterior proba- 
bility of discontinuity at time [t — 0.001, t + 0.001] for model A and B respectively. 
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Section 3.3 the resulting algorithm has computational and storage costs that are linear in 
the number of observations. The simulation results in Section 4.1, showed that, for the 
examples we considered, the error introduced by our approximations were negligible. 

One issue with our approach is that it is not simple to quantify the error in the approxima- 
tion. This is a common issue with approximate methods (see the discussion of Rue et al, 
2009). One approach is to use the approximation we develop as a proposal distribution 
within an importance sampling method (Kim et al., 1998). This idea is considered in Liu 
(2007), where it is show that the resulting importance sampling approach can be very 
efficient. 

We demonstrated the potential of this new procedure through the fitting of piece-wise 
quadratic functions. The model we fit allowed for both the possibility of continuity or 
discontinuity at changepoints. Our simulation studies showed that this model is more 
accurate at fitting curves that contain discontinuities than the related method of Denison 
et al. (1998). We also showed that it can also perform better at estimating the underlying 
curve than wavelet procedures, and more accurately characterises the uncertainty in the 
estimate of the curve. Further advantages of our approach is that it can allow for online 
inference, and also can allow for inference about the presence and locationdiscontinuities 
in the underlying signal. 

Appendix 

Here we give details of P s (t,m,() and u s (t,m,() for the piecewise polynomial regression. 
Now denote h = (1, A, A 2 ) where A = (x t — x s+ i) (suppressing the dependence on s and 
t). Then given the most recent changepoint is at time s, the mean of the observation at 
time t is h(3j '. 

Remember ( = (y, 7, //, D), and define (' = (1/, 7', //, D'). Define e = yt — hfi T , Q = 
hDh T + 1, and A = Dh T /Q. Then if (' = u s (t, m, (), we get 

i/ = v+l, 

y = 7 + e 2 /Q, 
fx' — fi + Ae, 
D' = D - A T AQ. 

Furthermore, let T^(x; a, R) denote the density of a student's t random variable d degrees 
of freedonm, and with mean a and scale parameter R. Then we have 

P s (t,m,C) = T u (y t ;hfi T , 1 Q/u). 
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